Extraction of Tag Tree Patterns with Contractible Variables from Irregular Semistructured Data

نویسندگان

  • Tetsuhiro Miyahara
  • Yusuke Suzuki
  • Takayoshi Shoudai
  • Tomoyuki Uchida
  • Sachio Hirokawa
  • Kenichi Takahashi
  • Hiroaki Ueda
چکیده

Information Extraction from semistructured data becomes more and more important. In order to extract meaningful or interesting contents from semistructured data, we need to extract common structured patterns from semistructured data. Many semistructured data have irregularities such as missing or erroneous data. A tag tree pattern is an edge labeled tree with ordered children which has tree structures of tags and structured variables. An edge label is a tag, a keyword or a wildcard, and a variable can be substituted by an arbitrary tree. Especially, a contractible variable matches any subtree including a singleton vertex. So a tag tree pattern is suited for representing common tree structured patterns in irregular semistructured data. We present a new method for extracting characteristic tag tree patterns from irregular semistructured data by using an algorithm for finding a least generalized tag tree pattern explaining given data. We report some experiments of applying this method to extracting characteristic tag tree patterns from irregular semistructured data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents

Many Web documents such as HTML files and XML files have no rigid structure and are called semistructured data. In general, such semistructured Web documents are represented by rooted trees with ordered children. We propose a new method for discovering frequent tree structured patterns in semistructured Web documents by using a tag tree pattern as a hypothesis. A tag tree pattern is an edge lab...

متن کامل

Assessing Behavioral Patterns of Motorcyclists Based on Traffic Control Device at City Intersections by Classification Tree Algorithm

According to the forensic statistics, in Iran, 26 percent of those killed in traffic accidents are motorcyclists in recent years. Thus, it is necessary to investigate the causes of motorcycle accidents because of the high number of motorcyclist casualties. Motorcyclists' dangerous behaviors are among the causes of events that are discussed in this study. Traffic signs have the important role of...

متن کامل

Identification of Patterns and Factors Affecting the Health of Employees Based on Datamining of Occupational Examinations with the Purpose of Promoting Occupational Health

Background and Objective: Paying attention to the health of workers as a significant part of the population is important as they play an important role in the development of the society, which also has caught the attention of government officials and World Health Organization (WHO). Based on the rules and regulations of workers in different occupations, each year they must undergo certain medic...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

اثر تناوب بهره‌برداری سقز بر زادآوری طبیعی درختان بنه (مطالعه موردی: جنگل‌های بنه استان کردستان، سنندج)

In order to evaluate the impacts of oleo-gum resin extraction periodicity on natural regeneration of wild pistachio (Pistacia atlantica subsp. kurdica), three different forest areas in Kurdistan province, west of Iran, were selected based on difference extraction periodicities (regular periodicity, irregular periodicity and without periodicity). Then homogenous unit maps in GIS produced, and on...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003